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Convergence rates of Markov chains have been widely studied in 
recent years. In particular, quantitative bounds on convergence rates 
have been studied in various forms by Meyn and Tweedie [Ann. Appl. 

Probab. 4 (1994) 981-1101], Rosenthal [J. Amer. Statist. Assoc. 90 
(1995) 558-566], Roberts and Tweedie [Stochastic Process. Appl. 80 
(1999) 211-229], Jones and Robert [Statist. Sci. 16 (2001) 312-334] 
and Fort (Ph.D. thesis (2001) Univ. Paris VI]. In this paper, we extend 
a result of Rosenthal [J. Amer. Statist. Assoc. 90 (1995) 558-566] 
that concerns quantitative convergence rates for time-homogeneous 
Markov chains. Our extension allows us to consider /-total variation 
distance (instead of total variation) and time-inhomogeneous Markov 
chains. We apply our results to simulated annealing. 

1. Time-homogeneous case. 

1.1. Introduetion. Let P be a Markov transition kernel defined on a gen¬ 
eral state space (A, B{A)). Denote by the corresponding k-step transition 
kernel. For ^ a probability measure on B{X) and / a Borel function, define 
CP{^) = H{dy)P{y,A) and Pf{x) = jP{x,dy)f{y). 

For f :X ^ [1, oo), the f-total variation or f-norm of a signed measure p 
on B{X) is defined as 

\\p\\f := sup \p{(j))\. 

When / = 1, the /-norm is the total variation norm, which is denoted jj/ijjTV- 
Our goal is to find explicit bounds on rates of convergence of ^P^ — P^ 
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to zero. In the special case in which P has a stationary distribution vr, 
this corresponds to bounding the convergence of to vr. Our results 
extend and sharpen the nonquantitative results developed, for example, 
by Meyn and Tweedie [(1993), Chapters 15 and 16], where one typically 
finds conditions under which there exists some rate function r{n) such that 
r(n)||P”(x, •) — Trjjj —> 0 as n —> oo. 

The problem of getting explicit bounds on \\P^{x,-) — vr||j has received 
much attention in recent years, motivated by control of convergence for 
Markov chain Monte Carlo and operation research problems [see, e.g., Jones and Hobert 
(2001)]. Most of the results available cover only total variation bound [see 
Rosenthal (1995) and Roberts and Tweedie (1999)]. To the best of our knowl¬ 
edge, the only explicit bound in /-total variation distance was given by 
Meyn and Tweedie [(1994), Theorem 2.3]. This bound is based on the Num- 
melin splitting construction and depends in a very intricate way on the 
constants of the kernel. In this section, we use a different approach, based 
on coupling. We obtain a bound (Theorem 2) which is simple, very gen¬ 
erally applicable and, although not tight, does improve on the work of 
Meyn and Tweedie [(1994), Theorem 2.3]. 

1.2. Assumptions and lemma. Let aA6 = min(a, 6) and aV5 = max(a, 6). 

To use the coupling construction, we first need a set where coupling may 
occur. We make the following assumption: 

(Al) There exist a set C G X x X, a constant e > 0 and a family of proba¬ 
bility measures {x,x') £ C} on X with 

(1) P{x,A) A P{x',A) > \/A£B{X), {x,x')£C. 

Following Bickel and Ritov (2001), we call C a {l,e)-coupling set. For 
simplicity, only one-step minorization is considered in this paper. Adapta¬ 
tions to m-step minorization can be carried out as in Rosenthal (1995). We 
note that condition (1) is in many cases satisfied by setting C = C x C, 
where C is a so-called pseudo-small set. Recall that a subset C G X \s 
{1, e)-pseudo-small if there exist a constant e > 0 and a family of probabil¬ 
ity measure {r'x,xA {x,x') £C x C} with P{x, •) A P{x', •) > £Vx,x'{') for all 
{x,x') £C X C [see Roberts and Rosenthal (2001)]. We stress that C is a 
subset of X and that, despite the obvious similarity, a (1, e)-pseudo-small 
set is not a (1, e)-coupling set. Recall finally that a set C is (1, e)-small if it is 
(1, e)-pseudo-small with the same minorizing probability measure v = Vx,x' 
for all (x, x') £ C X C. The primary motivation for using (1, e)-coupling set is 
that the usual pairwise coupling argument can be used without change and 
that, in some cases detailed below, (1, e)-coupling sets can be significantly 
larger than the product of (1, e)-pseudo-small sets. 
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To introduce the coupling construction, some additional definitions are 
required. Let .R be a Markov transition kernel that satisfies, for all (x, x') G C 
and all A G B{X), 

R{x,x';A X X) = {1- e)~^{P{x,A) - ei^x,x'{A)), 

( 2 ) 

R{x,x';X X ^) = (1 -e) ^{P{x',A) - 
For example, we can set, for {x,P) G C, 

R(x,x';Ax A') = {{I - e)~^{P{x,A) - si^x,x'{A))) 

x{{1-e)-\P{x',A')-eu^,^,{A'))), 

but other trickier constructions may also be considered. Similarly, let P 
be a Markov transition kernel on X x X such that, for {x,x') G C and all 
A,A'€B{X), 

(3) P{x, x'\A X A') = (1 — e)R{x, x' \ A x A!) + EUx^x'i^ Fi A')^ 
and satisfies, for (x,x') ^ C and all A G B{X), 

(4) P{x,x'-,Ax X) = P{x,A) and P{x,x'-,X x A) = P{x',A). 

For example, we can once again set, for (x, x') ^ C, P{x, x'; A x A') = R(x, ^)P(x', A'), 
to get that P satisfies (4) for all (x,x') ^ X x X. 

Dehne the product space Z = T” x T x {0,1} and the associated product 
sigma algebra B{Z). We define on the space (Z^,i3(Z)®^) a Markov chain 
{Zn := (Xn,X^,dn),n > 0). Indeed, given we construct Zn+i as follows. 

If dn = 1, then draw Xn+i ~ P{Xn, •), and set = Xn+i and dn+i = 1- If 
dn = 0 and {Xn,X^) G C, flip a coin with probability of heads e. If the coin 
comes up heads, then draw X from and set X^+i = X^_^_^ = X 

and dn+i = 1. If the coin comes up tails, then draw {Xn+i-,X'^j^^) from 
the residual kernel R{Xn,X^] •) and set dn+i = 0. If = 0 and ^ 

C, then draw {Xn+i-,X'^^-^) according to the kernel P{Xn, X'^] ■) and set 
dn+i = 0. Here dn is called a bell variable] it indicates whether the chains 
have coupled {dn = 1) or not {dn = 0) by time n. 

For /r a probability measure on B{Z), denote by the probability mea¬ 
sure induced on (Z^,H(Z)®^) by the Markov chain {Zn , n > 0) with initial 
distribution /i. The corresponding expectation operator is denoted by E^. It 
is then easily checked that {Xn,n > 0) and (X^,n > 0) are each marginally 
updated according to the transition kernel P; that is, for any n, for any 
initial distributions ^ and , and for any A,A'(^B{X), 

^^0^'0So{^n € A X X X {0,1}) = ^P”(H), 

ex X A'x{0, 1}) = ^'P^{A'), 


( 5 ) 
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where 6x is the Dirac measure centered on x and 0 is the tensor product of 
measures. Define the coupling time T = inf{/c >l]dk = 1} (with the conven¬ 
tion inf 0 = oo). Let P* be the Markov kernel defined, for all (x, x') ^ X x X 
and all A G B{X x X), by 


( 6 ) 


P*{x,x'-A) 


P{x, x'\A), if (x, x') ^ C, 

rIx,x']A), if(x,x')GC'. 


For pL a probability measure on x X, denote by P* and E* the probability 
and the expectation induced by the Markov chain on X x X with initial 
distribution /i and transition kernel P*. 


Lemma 1. Assume (Al). Then, for any n > 0 and any nonnegative 
Borel function (f)\[Xx X)'^~^^ R+, we have 

• • • ,Xn)l{dn = 0)} = E||g|^/{(/)(Xo,..., A'„)(l — }, 

where Xi := (Xi,X'), Ni := Y1]=q ■= 0- 


Proof. We first verify that the result holds for all functions 4>{xq, ..., x„) = 
nr=o^ i(xj), where Xj := {xi,x'f) and {'tpi,i > 0) are nonnegative Borel func¬ 
tions on B{X X X). The proof is by induction. For n = 0, the result is obvious. 
Assume that the result holds up to order n — 1 for some n > 1. We have 


• • •) Xji)i{dn — 0)} 


' n—1 


— E^ 0 ^' 05 q 'lpi{Xi)lQc{Xn-l)'lpn{Xn)i{dn — 0 )j 

+ Egig,^'05p / lfi{Xi)l^{Xn-l)'f’n{Xn)^{dn = O)! , 


. i=0 


where := X\C. Define Qk = <y{Zi = (Xj,(ij),0 <i<k). Note that, for 
n > 1, 


E{V^„(A„)l((i„ = 0)|^n_i}l(5c(A„_i)l((i„_i = 0) 

= P'f>n{Xn-l)lcc{Xn-l)l{dn-l = 0 ). 

Since A'„_ 2 lcc(A„_i) = Nn-ilc<^{Xn-i) and .P(x, x'; •) =P*{x,x'-,-) for (x,x') fz 
C, we have, under the induction assumption, 

n '^ii^i)'^C-i^n-l)'f’n{Xn)l{drr = 0)| 


'' n—1 


— E^0^/g,5o< 'lpi{Xi)lQc{Xn-l)P'lfniXn-l)'i-{dn-l —0) 


. i=0 
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( 7 ) 


' n—1 




N„_1 


. i=0 
' n 




U=o ) 

Similarly, note that 

= 0)'i/’n(-^n)|^/n-l}lc(^n-l)l(hn-l = 0) 

= (1 - e)R'ljjn{Xn-l)l{dn-l = 0). 

Since (A ^„_2 + l)lq{Xn-i) = Xn-ilci^n-i) and R{x,x'-, ■) = P*{x,x'-, •) for 
all {x,x') € C, the induction assumption implies 

{n 'lpi{Xi)l^{Xn-l)'lpn{Xn)Hdn = 0)| 


( 8 ) 


^n—1 


= (1 - 't{;i{Xi)lciXn-l)R'llJniXn-l)'i-idn-l = 0 ) 


. i=0 



n MX^)lciXn-l)P*MXn-l){l - 

i=0 


= n MXi)lciXn-i)il - 

U=o 

Thus, the two measures on B{X x defined, respectively, by 

A I—> E^0^/05p{l2i(Xo,..., Xn)^idn = 0)} and 
T ^ E|^^,{1a(Xo, .. .,Xn)il - 

are equal on the monotone class C = {A ■.A = AQy.---x A^, Ai G B{X x T)} 
and thus these two measures coincide on the product sigma algebra, which 
concludes the proof. □ 


1.3. Main time-homogeneous result. Let /: ff —> [1, oo] and let <;<!): ff ^ M 
be any Borel function such that sup^-^^t^ \4’{x)\/f{x) < oo. Using (5), the clas¬ 
sical coupling inequality [see, e.g., Thorisson (2000), Chapter 2, Section 3] 
implies that 

\^pn^ _ ^ " <^«)}| 

= |Eg®^'05o{((/)(X,T,) — (j){X'^))l{dn = 0)}| 

< (sup\^{x)\/f{x))E^f^^>^So{ifiXn) +f{X'J)lidn = 0)}. 

\x&X / 
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By Lemma 1, 

+/«))l(dn = 0)} = +/(X))(l - }. 

Thus, the following key coupling inequality holds: 

\iPnct>-i'pnct>\ 

(9) 

<(sup|</.(x)|//(x))E|^^,{(/(XO+/m)(l-e)^-i}. 

To bound the term on the right-hand side of (9), we need a drift condition 
outside C for the kernel P*: 

(A2) There exist a function V :X x X ^ [l;C)o) and constants b and A, 
0 < A < 1, such that 

(10) P*V <XV + blc. 

Theorem 2. Assume (Al) and (A2). Let f:X—)- [1,cxd) be a function 
which satisfies f{x) + f{x') < 2V{x,x') for all {x,x') G X x X. Then, for all 
,n -|- 1} and for all initial probability measures f, and ff on X, 

(11) UP^ - ('P^ty < 2(1 - eyi{j <n) + ® 

ll^pn _ < 2(1 - ey{b{l - A)-i + A-(e 0 e)iV))l{j < n) 

(12) 

+ 2X^B^-\^®y){V), 

where 

B = 1 V f^(l — e)A“^ sup RV{x,x''^. 


Proof. For any f G {1,. .., re + 1}, we have 

(13) < E|^^,{(/(X„) + /«))(! - e)^-il(lV„_i > j)} 

+ 2E|^^,{P(A„)(l-e)^-U(lV„_i<j)}. 

Consider the first term on the right-hand side of (13). We have 

E|^5,{(/(X0 + /«))(! - ef--n{Nn-i > j)} 

(14) 

<(l-e)^E|^^,{(/(X„)+ /«))}. 

If / = 1, then E|^^/{(/(An) -|- f{Xy))} = 2. Otherwise, by repeated applica¬ 
tion of the drift condition (A2), we have 

n—l 

{P*YV < X{P*Y-^V + 6 < A”P + 6 ^ A^ < A"C + 6/(1 - A). 

k={) 
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Since f{x) + f{x') <2V(x,x'), we get 

IEW{(/(^n) + /«))} < < 2A"(e ® COC^) + 26/(1 - A). 

Consider the second term on the right-hand side of (13). Denote for s > 0, 

Ms := X-^B-^^-W{Xs){l - 

We show that {Ms,s > 0) is an (.^,IP|(g,j/) supermartingale, where T := 
(Ts := a{Xi,i < s),s>0). The definition of Ng and the drift condition (A2) 
imply 

lc.iXs)Ns = lc^{Xs)Ns.i and lcc(A,)PW(A,) < lg.(X,)AD(A,). 
Thus, we have 

E*{Ms+i\Xs}lc^iXs) 

(15) = X-^^+^^B-^^P*V{Xs){l - £)^McciXs) 

= X-^^+^^B-^^-^P*V{Xs){l - e)^^-nc.iXs) < Mslc.{Xs). 

By dehnition, RV{x, x') < A(1 — £)~^B. Since by construction 

1qP*V = IqRV, we have 

E*{C(A,+i)|.T,}lc(W.) = W(A,)lc(A,) < A(1 - ey^BlciXg). 

Since 1q{Xs)Ns = 1q{Xs){Xs-i + 1), we have 

e*{m,+i|.t;}1c(w,) 

(16) 

< -e)^'>-i+^A(l -e)"^Bl(5(A^) < MslciXs). 

Equations (15) and (16) show that {Ms,s > 0) is an (.T, P||g,^/) supermartin¬ 
gale. By the optional stopping theorem, E|g,^,{M„} < E|^^,{Mo}. Since 
B > 1, we have l(A'„_i < j) < , which implies 

E|^€'{tA(X„)(l -e)'^"-il(iV„_i < j)} 

(17) < A”.Bl-^E|^^,{A-"B-^-W(A„)(l - e)^-i} 

By combining (14) and (17) for / = 1, we have 

EW{(/(^n) + /(X))(l-e)'^"-n 

< 2(1 - £yi{j <n) + 2X^B^-^i 0 i'{V) 

and (11) follows from (9). Similarly, for / such that f{x) + f{x') < 2V(x,x'), 
we have 

]E|«^'{(/(^n)+/(X;))(l-6)^-1} 

< 2(1 - e)l(A”(C 0 e')(^) + 6/(1 - A)) + 2A”Bl-i^ 0 ^'(E) 
and (12) follows from (9). □ 
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1.4. Application to convergence to stationarity. If P has a stationary 
distribution tt, (i.e., if ttP = vr), then we can choose = tt. Then ttP'^ = vr 
for all n and, hence, the results (11) and (12) allow us to bound — ttHtv 
and W^P"^ “ ■^11/) respectively. 

To compare our result with Meyn and Tweedie (1994), Rosenthal (1995) 
and Roberts and Tweedie (1999), we now derive from the explicit expres¬ 
sions of the bounds provided in Theorem 2 the rate of convergence for 
the total variation distance or the /-norm, that is, we find a bound for 
limsup^^oo n~^ log |jP"(a;, •) — 7r|| j. We follow the approach originally taken 
by Rosenthal (1995), but we adapt the results to the expression of the bound 
given in Theorem 2. 

Proposition 3. Assume (Al) and (A2), and that nP = 7r. Let f :X ^ 
[l,oo) he a function that satisfies f{x) + f{x') < 2V{x,x') for all {x,x') € 


At X Af. Then, for all x £ X 
limsupn”^ log \ \P^ 


log||P”(x,-) -vr||/ 



(18) 


where M := sup^^, 

Proof. By definition of P [see (3)], for all {x,x') G C we have 


{l-e)RV{x,x')+e / Va,^x'{dy)V{y,y) = PV{x,x')>{l - e)RV{x,x) + e, 


/ 


where we have used that R > 1. Thus 



which implies 

(1 — e) sup RV{x,x')X~^ < {M — e)\~^. 


(x,x')ec 


Assuming first (M — e)A ^ > 1, the bounds for total variation and /-norm 
can be expressed for j G {1,..., n}, 


P"(x, •) - ttIItv < 2(1 - ey + 2X^-^+\M - J V{x, x'y{dx'), 

2511 - fP 



fi — efi ^ J V{x,x')'K{dx') 








BOUNDS FOR INHOMOGENEOUS MARKOV CHAINS 


9 


The result follows by choosing 

— log(A)n 

^ l_log((M-e)/A)-log(l-e). ■ 

When (M — e)A“^ < 1, we put j = re + 1 in (11) and (12), showing that 
||^pn_^/pn||Tv <2A”y’ V(x,x')7r(dx') and 

||^pn_^/pn||^ <2A”y’ V(x,x')7r(dx'). 

The result follows. □ 

Remark 1. The bounds we find in this paper for the /-total variation 
distance are the same as those found for the total variation distance by 
Roberts and Tweedie [(1999), Theorem 2.3]. 

In some applications, the minorization and drift conditions (Al) and (A2) 
are more naturally expressed in terms of the kernel P, and it is thus re¬ 
quired to derive the bivariate drift and minorization conditions from the 
corresponding single variate conditions [Rosenthal (1995), Theorem 12, and 
Roberts and Tweedie (1999), Section 5]. The crucial point here is to relate 
the bivariate drift condition (A2) to single variate drift condition. We es¬ 
sentially follow to Rosenthal’s [(1995), Theorem 12] argument, which allows 
us to construct such a drift function V from univariate test functions [see 
Roberts and Tweedie (1999), Theorem 5.2, for a refinement of this result]. 
Consider the following assumption: 

(S) There exist a function V and a constant c such that: 

• The level set C = {x G A: R(x) < c} is (1, e)-small; that is, P{x, •) > 
ere(-) for all x G C for some e > 0 and some probability measure re. 

• There exist Ac < 1 and be < oo such that PV < XcV + bale and 
Ac + be /(1 -|- c) < 1. 

Under (S), C = {(x,x'); U(x) < c, U(x') < c} is a (1, e)-coupling set, that 
is, for all (x,x') G C and all A G B{X), P{x,A) A P{x',A) > ei'(A). Define 
the univariate residual kernel R as 

(19) R{x,A) = {l-e)-\P{x,A)-eu{A)) Vx G C, V A G R(A). 

To apply Theorem 1, we need to define the kernels R, P and P*. Because 
the drift condition is expressed on the univariate kernel P, we define both R 
and P from the corresponding univariate kernels R and P. More precisely, 
for all A, A' G B{X), define 

(20) R{x,x'; A X A') := R{x, A)R{x',A') if(x,x^)GC, 

(21) P{x,x']AxA'):=P{x,A)P{x',A') ii{x,x')^C. 
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These kernels satisfy (2) and (4). 

Proposition 4. Assume (S). Then (Al) is satisfied with C = C x C 
and Vx^x' = P for all {x,x') & C x C. Define P* as in (6) with R and P given 
in (20) and (21). Then (A2) is satisfied with V{x,x') = {1/2){V{x) + V{x')) 
for all {x,x') ^ X X X with 



Proof. The proof follows from Roberts and Tweedie [(1999), Theo¬ 
rem 5.2]. Since, for {x,x') ^ C, (1 -|- c)/2 < V{x,x'), we have 




and, for {x,x') xC, 


P*V{x,x') = ^{RV{x) + RV{x')) 





where we have used that, for (x, x') gC, V{x, x') < c. The proof follows. □ 

Under (S), we may thus apply Theorem 2 with f = V which yields explicit 
bounds for the total variation and the U-norm, under the assumptions used 
by Rosenthal (1995) and Roberts and Tweedie (1999) to obtain bounds for 
the total variation distance [see also Rosenthal (2002)]. It is worthwhile to 
note that (see the discussion above) the rate of convergence in U-norm is 
the same as the rate of convergence in total variation. 

Remark 2. It may be checked that if the sets {V <d} are 1-small for 
all d> c, then assumption (S) is always satisfied for large enough d [see 
Roberts and Tweedie (1999), discussion following Theorem 5.2]. 

We summarize the discussion above in the following theorem. 

Theorem 5. Assume (S). Then, for all j G {1,... ,n -|- 1} and for all 
initial probability measures ^ and on X, 


W^pn _ < 2(1 - syiij <n) + + e'(U)), 

- C'P^Wv < 2(1 - eym - A)-' + A"(e(U) + C'{V))/2)l{j < n) 
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where A = Ac + he /(1 + c) and 


5 = IV 


(^(l-e)A ^ 


sup RV (x) 
xec 


1.5. Example. We conclude this section with a simple example that shows 
a situation where we can exploit the additional degree of flexibility brought 
by (1, e)-coupling sets. Consider the Markov chain on defined for k G Z"*' 
by 

= g{Xk) + Zk, 


where: 


1. g' is a Lipshitz function over for some norm |[ • || with Lipshitz constant 


IlS'llcip 


sup 

x^y 


\\g{x)-g{y)\\ 

\\x-y\\ 


2. {Zk,k > 0) is a sequence of independent and identically distributed ran¬ 
dom vectors with density q w.r.t. Lebesgue measure on M'^. In addition, 
q is positive and continuous. 


It is known [see, e.g., Doukhan and Ghindes (1980)] that under these as¬ 
sumptions the Markov chain is positive recurrent and thus has a unique 
invariant distribution. Define for d > 0, 

(22) C{d) := {(x, x') G M"* X : ||x - x'|| < 6}. 


Using a A 6 = (l/2)((a-|-6) — |a —6|), it is easily shown that for all {x,x') G 
C(5) and all GlGe(M'^), 

P{x,A) AP{x',A) 

{g{z-g{x))+q{z-g{x))-\q{z-g{x))-q{z-g{x'))\)dz 
J A 

and thus P{x,A) A P{x’,A) > £{5)i'x,x'A) with 

^x,x'{A) = jJ^q{z - g{x)) + q{z - g{x')) 

- - g{x)) - q{z - ff(x'))|) dz 

(23) 

j\g{z-g{x))-q{z-g{x'))\dz^ , 

e((5) = l-| sup j\q{z-{g{x)-g{x')))-q{z)\dz. 
{x,x')&C(S) J 
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Note that for all (x,x') G C(6), \\g{x) - gix')\\ < HffULiplk “ x'\\ < |bllLip5- 
Since the function u —>■ J \q{z — u) — q{z)\ dz is continuous and q is everywhere 
positive, for all (5 > 0, the set C{5) is a (1, e((5))-coupling set. 

Let 5 > 0. For all (x, x') G x and all A, A' G define P by 

P{x,x';A X A') = J lA{f{x) + z)lA'{f{x) + z)q{z)dz 

and let, for {x,x') €C{6), 

Rs{x, x'; A X A') = (1 — £{5))~^{P{x, x'-,Ax A') — £{5)vx,x'{-^ Fi ^0)- 

It is easily checked that and P satisfy (2) and (4), respectively. Finally, 
define P/ as in (6). 

We now determine an explicit bound for the total variation distance. Put 
V{x,x') = 1 + llx — x'\\. Note that for all {x^x') G x 

PV{x,x') = 1 + || 5 r(a:) - gix')\\ < 1 + ||c/||Lip||x - x'||. 

Choose A such that HffllLip < A < 1. By construction, for all (x,x') ^ C(d), we 
have ||x —x'll > 6. Hence, for any 5 > (1 — A)/(A— ll^lluip) and all (x, x') ^ (7(0, 6), 
we have 


1 + IbllLiplk - x'W = A(1 + \\x - x'll) + (1 - A - (A - ||ff||Lip)lk - x'll) 

< A(1 + ||a: - x'll) + (1 - A - (A - ||ff||Lip)<5) 

< A(1 + ||x — x'll). 

It remains to prove that sup^^, RV{x,x') < oo. Note that 


sup RV{x,x')< 

{x,x')gC(S) 


^^Pix,x')eCiS) PV{x,x') 
l-e{5) 


fW ^ 1+ ||g||Lip<^-g(^) 

1-gW 


Summarizing our findings, for any A with H^HLip < A < 1 and any <5 > (1 — 
A)/(A— ll^llLip), (Al) is satisfied with e:=e((5) and (A2) is satisfied with 
V{x,x') = I + ||x — P||. We may thus apply Theorem 2 to obtain a total 
variation distance bound as follows. [Note that with this choice of bivariate 
drift function V we may only compute total variation bound; the condition 
f{x) + f{x') < 2(1 + ||x — x'll) indeed implies that / < 1.] 


Proposition 6. For all A such that ||(7||Lip < A < 1, for all 5 > (1 — 
“ llsIlLip); for all j £ {1,... ,n + 1} and for all initial probability mea¬ 
sures ^ and p on X, 

||^pn_^.pn||Tv 

<2{l-£[5))n{j <n) + 2\^B^-^(l + Jjpdx)p{dx')\\x-x'\\y 

where e{6) is defined in (23) and 

B = iy{\-Hl + \\g\\up5-e{6))}. 
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2. Time-inhomogeneous case. We now proceed to extend Theorem 2 to 
time-inhomogeneous chains. Specifically, we consider a family {Pk,k > 1) of 
Markov transition kernels. That is, we allow Pk{x,A) to depend not only 
on the starting point x and the target subset A, but also on the time pa¬ 
rameter k. For example, this would be the case for simulated annealing and 
hidden Markov models; a specific example is discussed in Section 3. 

2.1. Assumptions and lemma. The assumptions and notations parallel 
those from the time-homogeneous case. We first assume the following mi- 
norization condition. 

(NSl) There exist a sequence (C^, /c > 1) of subsets of x ff, (7^ C df x , 
a sequence (fffc, A; > 1), > 0, and a family of probability measures 

{i^k,x,x'i {x.,x') G Ck, k>l) such that 

Pk{x, •) A Pk{x , •) ^ ^k^k,x,x' (.’') • 

Let {Pk, A: > 1) be a family of transitions kernels that satisfy, for all k, the 
analog of (4) with P = Pk and let {Rk, A; > 0) be a family of transition kernels 
that verify, for all k, the analog of (2) with P = Pk, i'x,x' = k'k,x,x', e = and 
C = Ck- The proof is based on straightforward adaptation of the coupling 
construction used in the homogeneous case. For n > 0, if (Xn,X^) G Cn+i 
and dn = 0, flip a coin with probability of success En+i- If the coin comes up 
heads, then draw Xn+i from Vn+i,Xn,X!^ and set Xn+i = and dn+i = 1. 

Otherwise, draw {Xn+i,X'^j^i) from •) and set dn+i = 0. If 

{Xn,X'^) ^ Cn+I and dn = 0, then draw (X„+i,X4+i) from Pn+i{Xn, X'n, •) 
and set dn+i = 0. Finally, define {Pk,k > 0) to be the family of transition 
kernels defined as the analog of (6). For /r a probability measure on x , 
denote P* and E* the probability and the expectation induced by the Markov 
chain with initial distribution /r and transition kernels A; > 0). 

Lemma 7. Assume (NSl) and let f\X ^ [l,-|-oo). For any probability 
measures on B{X), for any n>l, 

\\^P,...p^-^'P,...pjf 

<El^^,hf{Xn)+fiX'J)f[{l-sAcSX^-l))\, 

where Xi = {Xi,Xi). 

The proof can be adapted from Lemma 1 and (9). We also assume the 
following drift condition: 
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(NS2) There exist a family of functions {Vk}k>o, Vk'.X x T —> [l,oo), and 
two sequences (A^, fe > 0), 0 < Afc < 1 for all k>l and {bk,k >0), 
such that 

(25) < Afci4 + V A; > 0. 

Define for j € {1,..., A;}, 

j j 

{l-e)j^k-= max TT(l-efc) and Bjk-= max TT 

1 = 1 1 = 1 

where, for any integer k, 

(26) Bk:=ly ({I-£k)( sup RkVkix,x')]Xk\\ 

V \(x,x')eCk ) X 

By convention, we set B^^k = 1 for all k. 

2.2. Main time-inhomogeneous result. We can now state our main resnlt, 
as follows. 


Theorem 8. Assume (NSl) and (NS2). Let {fk,k>0) be a family 
of functions such that, for all k >0, fk{x) + fk{x') < 2Vk{x,x') for all 
(x, x') G Xl X X. Then, for a// j G {1,..., n + 1} and for all initial probability 
measures ^ and , 


(27) 


(28) 


UPl---Pn-i'Pl---PnhY 

< 2(1 - < n) + 2 ® ?0(foo), 

< 2(1 - £)j,nDnl{j < n) + 2 ^J[ x}j 0 e')(^^o), 


where := (n(Lo^ ® ?^(^o) + Z)j=o (nr=/+i convention 

Hi=i = 1 when i> j. 


Proof. The proof is along the same lines as for the time-homogeneous 
case. Denote Nk = Z)j=o ^Cj+i For any j G {1,..., n -|- 1}, we have 

iUXn) + fn{X'^)) n(l - eac,(X^-l)) 

[ i=l 


< (1 - £)j,n^^^>{{fn{Xn) + fniX'J)}l{j < n) 

+ 2E|^^, jK(X„) n(l - £aaiX^-l))l{Nn-l < j)|, 
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where we have used that nr=i(i > j) < (1 - e),>. 

When fn = l, 

IEW{(/n(^n) + /nra)} = 2. 

Otherwise, 

EW{(/n(^n) + /n(XO)} < 2E|^^,{K(X„)} < 

Now, since by definition Bj > 1 [see (26)], we have Bj^n < ^j',n for all 0 < 
j < j' <n and 


which implies that 

I' n '' 

EW Vn{Xn) n(l - eilc,iX^-l))liNn-l<j) 


(29) 


2 = 1 


/n—1 




where, for s > 0, 

/s—1 \ ~^ ^ 

(30) M, := n 11(1 - e,lc^{X,_,))Vs{X,,X'). 

\j=o / j=l 


As above, {XIs,s > 0) is an (^, supermartingale w.r.t., where X := 

{Xs '■= cr(Aj, 0 <j<s),s> 0}, which concludes the proof. □ 


3. Application to simulated annealing. In this section, we apply the 
results above to study the convergence of the simulated annealing (SA) 
algorithm for continuous global optimization [see Locatelli (2001, 2002), 
Fouskakis and Draper (2001), Andrieu, Breyer and Doucet (2001) and the 
references therein]. 


3.1. Assumptions. Let / be a function defined on M, and let Ad be the 
set of global minima of / (to keep the discussion simple, multidimensional 
versions are not considered here). We make the following assumptions: 

(SAO) the function / is twice continuously differentiable and there exist 
a > 0, xi G M, such that, for all y > a; > xi, 

(31) f{y)-f{x)>a{y-x) 

and similarly, for all y < x < — xi, 

f{y)- f{x)>a{x-y). 


(32) 
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(SAl) For each x G M, we have f"{x) > 0. 

Under (SAG), M C [—xi,xi], that is, the set of global minima of / is 
contained in the interval [—xi,xi]. Assumption (SAl) implies that the global 
minima are isolated and thus, that the set A4 is hnite. Assumption (SAG) 
implies that for all 7 > 0, /exp(— 7 /(y))//^®'^(dy) < oo, where is the 
Lebesgue measure over M. 

Consider a candidate transition kernel, Q{x,A), x G M, A G -B(M), which 
generates potential transitions for a discrete time Markov chain evolving 
on M. We focus on the case where the candidate points are proposed from 
a random walk with increment distribution that has a density q with re¬ 
spect to :Q{x,A) = J^q{y — x)fj}"^^{dy), Ag 0(M). In addition, make 
the following assumption: 

(SA2) The proposal density q is continuous and strictly positive and sym¬ 
metric: q{y) > G and q{y) = q{—y)- 

3.2. The random walk Metropolis-Hastings algorithm. The random walk 
Metropolis-Hastings (RWMH) algorithm corresponds to the Hastings-Metropolis 
algorithm introduced by Metropolis, Rosenbluth, Rosenbluth, Teller and Teller 
(1953) and Hastings (1970). It proceeds as follows to sample from the (un¬ 
normalized) distribution exp(— 7 /(x))^'^®'’(dx) for 7 > 0. (For RWMH, the 
“inverse temperature” parameter 7 is held constant. We see later that with 
simulated annealing, by contrast, 7 is modified at each iteration of the al¬ 
gorithm.) 

Given the current state x, a candidate new state y is chosen according to 
the law Q{x, •). This candidate y is then accepted with probability a.y{x, y), 
where 


a^{x,y) = 1 A (exp(- 7 (/(y) - f{x)))). 


The RWMH kernel is thus given by 

K^{x,A)= a^{x,y)q{y - x)y^^'°{dy) 

+ 6^{A) J{l-a^{x,y))q{y-x)y^^^{dy), Ag 
I t then follows that 7 r.y(-) is a stationary distribution for K.y, where 


(33) 


7r^(A) 


/^exp(-7/(x))/r^®’"(dx) 

lRexp{-'yf{x))fA’^'^{dx) 


VAgH(M). 


The RWMH algorithm on M was extensively studied by Mengersen and Tweedie 
(1996), who showed that the transition kernels K.y are 7 r.y-irreducible (Lemma 1.1) 
and that all the compact sets are small (Lemma 1.2). 



BOUNDS FOR INHOMOGENEOUS MARKOV CHAINS 


17 


Lemma 9. Assume (SA0)-(SA2). Then, for every compact subset C of 
M such that > 0, we have for all x G C, K^{x,A) > e^'u^{A) with 

(34) e,:=Ee-^^X^^\C) and z.(A) := 
where 

(35) d := sup/(x) — inf/(x) and s := inf q{y — x) > 0. 

x&C 3;GC {x,y)£CxC 

Proof. For all x G C, 

A^(x,A)> [ Al)g(y-x)M^®’’(dy) >ee-T"='AL'^^(AnC). 


To apply Theorem 8 , we need to find drift functions that satisfy drift con¬ 
ditions outside the compact sets of M. The existence of drift functions for 
the RWMH algorithm was shown by Mengersen and Tweedie [(1996), The¬ 
orem 3.2], The proposition below relaxes some of the assumptions required 
in their result, and shows that the same drift function can be taken for all 
the Markov kernels for large enough 7 . For 0 < s < 7 , let 14(x) := 
and 


(36) 




_ s\ (7 -s)4 


Proposition 10. Assume (SA0)-(SA2). Then, for all (5 such that 1/2 < 
P <1, there exist x< 00 , 7 > 0 and s > 0 such that: 

(i) {K^Vs{x))/{Vs{x)) < r( 7 ,s) for a// x G M and 7 > 0; 

(ii) {K-yVslx))/{Vs{x)) < P for all jxj > x and 7 > 7 . 

Proof. By (33) and using that Vs{y) = we have, for 7 > s > 0, 

( 37 ) ^ r ^ 

Vs{x) J 

where ip^^s{u) := u~^{u'^ A 1) -|- 1 — {u'^ A 1). We easily check that, for all 

u > 0, 

7-s y/"' 

7 / - 


(38) 


9^7,8 (l^) ^ 9^7,s 


= r( 7 ,s) 
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which proves the first assertion of the proposition. Now, for any e > 0, we 
prove that there exists some x such that 


lim sup 

'r^°°x>x 


K^Vsjx) 

Vs{x) 




The proof of the corresponding inequality where x > x is replaced by x < —x 
follows the same lines. Choose M > 0 such that 

f-M 

/ <e/2. 

J — OO 


Inserting this inequality into (37), where z = y — x, and using (38) yields 


K^Vsix) 

Vs{x) 


< / ^'yA^ 

J-M 


-(f{x+z)-f{x 


^'>)qiz)A'''°{dz) 


For all X > X := xi + M and all —M < z <0, we have by assumption (SAG), 
exp(—(/(x + z) — /(x))) > exp(—aa:) > 1 and since ip^AA — 


K^Vsjx) 

Vs{x) 


< 


r e'^^^q{z)A"'"{dz) + rA,s) 

J-M 


1 + e 


Now, choose s sufficiently large so that the first term on the right-hand side 
is less than e/ 2 . Once s is chosen, we easily check that lim..),^oo ^( 7 ) s) = 1. 
This proves the second assertion. □ 


Define K^{x,x'] A x A') = K-y{x, A)K^{x', A') and, for s > 0, V's(x,x') = 

{l/2){V,ix) + Vsix'))- 


Proposition 11. Assume (SA0)-(SA2). For all s>0 and for all c>0, 
{P/ <c} is a compact 1-small set for Moreover, there exist 0 < Aq < A < 
1 , s > 0 , Co < c, b and 7 such that, for all 7 > 7 , 

(39) K-^Vs < XqVs + hi 

(40) K-fVs < AFs -I- 61{\4<c}x{e,<c}- 

Proof. The compactness of {14 < c} is straightforward from (SAG). 
Then, by Lemma 9, it is a 1-small set for Equation (39) follows from 
Proposition 10. To prove (40), write for c> cq, 

K-^Vs < A 0 I 4 + ^l{i4<c}x{E,<c} + (V2)(l{i4<c}x{v,>c} + l{E,>c}x{E,<c})- 

Set 0 < Ao < A < 1 and c = ( 6 /(A — Aq) — 1) V cq. We have, for all (x, x') G 
{P.<c}x{p,>c}, 

V2 < (A - Ao)(l + c )/2 < (A - Ao)P.(x,xO, 
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which implies 

(A 0 I 4 + {h/2)) l{i4<c}x{y,>c} < -^^sl{y,<c}x{y,>c}- 

This concludes the proof. □ 

The key point in the above result [also outlined in Andrieu, Breyer and Doucet 
(2001)] is that, for large enough 7 (7 > 7), all the transition kernels sat¬ 
isfy a drift condition outside the same small set {14 < c} x {14 < c}, with 
the same drift function I 4 and the same constants A and b. 

3.3. The simulated annealing algorithm. We now consider the simulated 
annealing case. Here 7 = 7i depends on the iteration, and for the Th iter¬ 
ation, the kernel Pi = is used. Define similarly Pi = and vrj = 7r.y.. 
Denote C = {I 4 < c} x {14 < c}, with the constants s and c chosen to sat¬ 
isfy (40). For {x,x') G C, set Ri{x,x']A x A') = Ri{x,A)Ri{x',A'), with 

(41) Ri{x,A) = {l-ei)~^{Pi{x,A)-SiUi{A)), Si = and Ui = 

where and are defined in (34). We may now state the main result of 

this section. 


(42) 


Theorem 12. Assume (SA0)-(SA2). For^>0, set 

log(z 1) 


7i = 


d{i+() + 2 - 


where d is defined in (35). Then for any initial probability measure pL, we 
have 


(43) 


lim ll/iPi • •- TTnllTV = 0. 


Proof. For any 1 < m < n, we have 


WhPl ■■ ■ Pn — TTnllTV 

(44) < ||(/iPl • • •Pm)Pm+l ■ ■ ■ Pn— '^mPm+1 ' ' ' Pn||TV 

n—1 

+ X] \\'^lPl+lPl+2 ■■■Pn — 'n'l+lPl+lPl+2 ■ ■ -PnllTV- 

l=m 

Let (on, n > 0) be a sequence of integers such that limsupjj^oo(o 4 ^ + ®n/n.) = 
0. Note that for sufficiently large n, 

(AL"'’(C))“^ X = ^ X e-'^P = e-T^e X (1 + 

i=n—an i=n—an i=n—an 

Hence lim^^oo Y.'i=n-a„ £i = 00 . 
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From Proposition 11, we have supj sup( 3 , x') < oo, and thus 

there exists an integer I such that A* supj sup^^, -Rit4(x, x') < A, with 
A < 1 satisfying (40). Since 

A* sup sup .RjPs(x, x^)A“^ < 1, 
i {x,x')ec 


Theorem 8 implies that, for all n> {I + l)a,i and any initial distributions 
^ and 


\\^^n—{l+l)an ’ ' ’ ? ^n—{l+l)an ’ ' ' '^^^llTV 


< 


n (1 


- Si 




<exp(- ^ ej+A“~^0^'(P^). 


To bound the first term on the right-hand side of (44), we use the expres¬ 
sion above with ^ = /rPi • • - Pm and = Tim with m = n — {I + l)an — 1- 
Equation (39) implies that for any initial distribution /i and any integer m, 

//Pi • ■ • PmVs < X^flVs + —V. 

Since vr^P^ = , 

'^m^s ^ Aovr^V^ -|- b —i* '^mVs ^ z r • 

1 — Ao 

Hence //Pi • ■ • Pm 0 '^miys) < X'^nVs/2 + b/{l — Aq) < oo, which implies 


(45) 


li^WyPl • • ■P„_(;+i)„„_i)P„_(;+i)„„ • • • 

Pn ~ '^n-{l+l)an — lPn—{l+l)an ' ' '-PnllTV ~ 


We now bound the second term on the right-hand side of (44). For any 
Z G {1,... ,n}, ||7r;Pi+i ■■■Pn- TTi+iPi+i ■' -Pnllxv < hi “TTz+iUtv and thus 


n—1 n—1 

X] hlPl+l ' "Pn — T^l+lPl+l ■ ■ ■ Pnllxv < X “ T+i||tV- 

l=m l=m 

To bound this difference we use Lemma A.l, which simplifies the argument 
in Haario, Saksman and Tamminen (2001). This lemma shows that 

n—1 

X hi - Ti+lllxv < 21og(Z(7m)/0(7n)), 

l=m 


(46) 
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where Z{'y) = /^e sup^gj^e . Using the Laplace formula 

[see, e.g., Barndorff-Nielsen and Cox (1989)], it may be shown that 


(47) Z(7) = (27r7 i {f {x)) ^/M(l + o(l)) as 7^00, 

\x&M / 

where Ai is the set of global minima of f{x) (recall that, under the stated 
assumptions, these minima are isolated and there are only a finite number 
of them). For any integer j, (46) and (47) show that 


(48) 


n—1 

lim V \\-Ki - 7r;+i||TV < 2 lim 

n—foo n—KX) 

l=n-jan 



Z(jn-jan) \ 

Z{ln) ) 


< lim log 

n^oo 



In-jan 7 


Together with (45), this concludes the proof. □ 


APPENDIX A; TECHNICAL LEMMAS 

Lemma A.l. Let h be a nonnegative function on a measurable space 
(A’,H(A’),/i). Assume that 0 < f dp, < 00 for al/ 7 > 70 > 0 and that 
\\h\\oo = Qsssiippf; h{x) := ini{M: p{x: h{x) > M} = 0} < 00. For^>^Q, de¬ 
note by the measure over with probability density function 

W / J h"/ dp w.r.t. p. Then, for 7' > 7 > 70, 

ll%-MyllTv<21og(|M), 

Proof. Shelf’s identity shows that 

\\p-r - hYhv = J \f -g\dp, 

where / = /i'^/ J h"^ dp and g = h'^'/ J h'^' dp. Note that //||/||oo = (/i/||/i||oo)"’'> 
a/yWoo = (V||/i||oo)'^', p-a.e. and ||5||oo/||/||oo = Z{y)/Z{i). The proof fol¬ 
lows from Lemma A.2, which may be of independent interest. □ 


Lemma A.2. Let f and g be two probability density functions w.r.t. 
a common dominating measure p on {X,B{X)). Assume that ||/||oo < 00 
and ||5r||oo < 00, and /(x)/||/||oo > 5'(x)/||5r||oo, p-a.s. Then 

J 1/< 21og(||5||oo/||/||oo)- 
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Proof. Using the inequality (||/||oo/||5lloo)5 <f^^<i\f-g\ = f + g- 
2{f A g), we have 





and the proof follows from the inequality 

1 — X < log(l/x) for X > 0. 


□ 
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